Using the Advanced PDF Object

The Advanced PDF object in Advanced Process Automation enables you to capture text from a PDF document using an advanced OCR engine.

To capture text, images must have a resolution of at least 200 dpi, with a contrast of 50% brightness or more.

Each PDF file being loaded cannot be larger than 2GB.

OCR functionality is included in the Advanced PDF object, which can be found in the Direct.Vsd.Library. Use the Advanced PDF object for getting text and tables from the PDF only.

The following languages are supported: Arabic, Armenian, Azeri, Bashkir, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Estonian, English, Farsi, Finnish, Dutch, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Tatar, Thai, Turkish, Ukrainian, Vietnamese.

Advanced PDF Object Functionality

You can review the available functions of the Advanced PDF object from the Direct.Vsd.Library in the Real-Time Designer.

To view Direct.Vsd.Library functionality:

In the Real-Time Designer, select the Project tab.
In the Reference section, browse to Library References > Direct.Vsd.Library.
In the Functionality tab, from the Type drop-down list, select Advanced PDF.

The following properties are available:

Property	Description
Active Page	The number of the page in PDF from which you want to get the data. By default, it's the first page in the PDF.
Block Count	Gets the number OCR blocks from this PDF page
File Name	The full path to the PDF file, for example: C:\Data\Sample1.pdf
Handwriting Identification Mode	Indicates the name of the Handwriting style, when OCR Mode is set to Handwriting. Otherwise, the value of this property is empty.
Languages	The list of recognition languages. The default is English.
OCR Mode	The OCR mode being used by the OCR engine
Pages Count	The number of pages in the PDF document.
Tables Count	The number of the recognized tables in the active page. This property gets the number of tables only after the table's data recognition (first run of function to get text from the table, using the known index of the table in advance).

The following functions are available:

Function	Description
Crop Image	Crops the image from the PDF active page and retrieves a new Advanced Picture object.
Determine Brightness	Determines a brightness value in percentages of a given area of Advanced PDF object by coordinates (100 is white).
Get Block Words	Recognizes text and returns VSD Word Data Object with words and rectangles collection for specific OCR Text Block.
Get Checkmark State	Returns the state of the checkmark inside a given rectangle and checkbox type (Square or Circle). Possible return values are Checked, NotChecked, Corrected or Not Recognized. If OCR is not installed, NotDetected is returned. This function is new to version 7.2. See Using the OCR Get Checkmark State Function.
Get OCR Text Block	Returns recognized text from specified OCR block by its index.
Get Page Text	Gets text from the active PDF page.
Get Suspicious Data	Returns the list of OCR Suspicious Data Object, which includes OCR paragraphs with its coordinates and text, where one or more words are suspicious. Suspicious words can then be identified from the returned paragraph’s text by the user.
Get Table	Gets the text from the specified table as a list of list of text (per each cell at the rows).
Get Table Cells Rectangle	Gets the list of coordinates of the rectangle that encloses the cells of the specified table.
Get Table Cells Text	Gets the list of text for each cell from the specified table.
Get Word Locations	Gets a list of coordinates of the rectangles that enclose the given word. This is new from version 7.2. See Using the OCR Get Word Location Function.
Get Words	Recognizes text and returns VSD Word Data Object with words and rectangles collection.
Load PDF Page Collection	Loads the pages whose page numbers are specified from a PDF file. Only the specified pages will be loaded from the PDF. Use this function to reduce processing time when your PDF is very large.
Load PDF Page Range	Loads a specified number of pages from a PDF, beginning from a designated start page. Only the specified pages will be loaded from the PDF. Use this function to reduce processing time when your PDF is very large.
PDF Active Page To Picture	Transforms the PDF Active Page to an Advanced Picture object.
Set Languages	Sets the recognition languages from a predefined list of supported languages
Set OCR Mode	Sets the OCR mode used by the OCR engine. If the Default option does not work, you can try one of the following options: Text (Speed): Use this option for a text document where speed is important. Faster than Text(Accuracy), but potentially less exact. Text (Accuracy): Use this option for a text document where accuracy is important. Slower than Text(Speed), but potentially more exact. Document (Speed): Use this option for a text document with objects such as tables where speed is important. Faster than Document(Accuracy), but potentially less exact. Document (Accuracy): Use this option for a text document with objects such as tables where accuracy is important. Slower than Document(Speed), but potentially more exact. Barcode (Speed): Use this option for a barcode where accuracy is important. Slower than Barcode(Speed), but potentially more exact. Barcode (Accuracy): Use this option for a barcode where accuracy is important. Slower than Barcode(Speed), but potentially more exact. Business Cards: Use this option for business cards. Engineering Drawings: Use this option for drawings and charts. One Block: Also called Field Level Recognition. Use this to OCR a single block on the screen. Use this option together controls or with crop functions to demarcate the block. Handwriting:Use this option for handwritten content. For the Handwriting option, you can choose between different handwriting styles. Select Handwriting, and then the style. Hover over a style to see a tooltip with a description.